Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian
نویسندگان
چکیده
This paper describes the La Repubblica corpus, currently being developed at the SSLMIT of the University of Bologna. The corpus is a very large collection of newspaper text, currently amounting to 175 million words, but expected to grow to 400 million before the end of 2004. When completed, it will contain all the articles published between 1985 and 2000 by the national daily La Repubblica. The paper discusses the techniques used to extract the text, tokenize it and annotate it (basic TEI annotation, POS tagging, genre/topic categorization), it presents examples of how it can be used, and gives details of the ways in which interested users can access it. The paper concludes with a discussion of current and future developments, and of weak and strong points of this resource.
منابع مشابه
EMOCause: An Easy-adaptable Approach to Extract Emotion Cause Contexts
In this paper we present a method to automatically identify linguistic contexts which contain possible causes of emotions or emotional states from Italian newspaper articles (La Repubblica Corpus). Our methodology is based on the interplay between relevant linguistic patterns and an incremental repository of common sense knowledge on emotional states and emotion eliciting situations. Our approa...
متن کاملSequenze N+pN (Nome Comune + Nome Proprio): Descrizione Linguistica da un Corpus dell'Italiano (N+pN (Common Noun + Proper Noun) Sequences: Linguistic Description from an Italian Corpus)
English. This paper describes the most important N+pN (noun + proper noun) structures in Italian from the corpus of La Repubblica 2002-2005 Italiano. Il contributo descrive le strutture N+pN più significative estratte dal corpus de La Repubblica 2002-2005.
متن کاملDevelopment of a Corpus Workbench for the METU Turkish Corpus
We will introduce a corpus workbench designed and implemented for the METU Turkish Corpus. The workbench design introduces a number of useful features and the workbench itself is basically usable with any TEI and XML compliant corpus, provided that it can be indexed in the format required by the workbench.
متن کاملTEI P5 as an XML Standard for Treebank Encoding∗
The aim of the paper is to show that a subset of Text Encoding Initiative Guidelines is a reasonable choice as a standard for stand-off XML encoding of syntactically annotated corpora. The proposed TEI schema — actually employed in the National Corpus of Polish — is compared to other such candidate standards, including TIGER-XML, SynAF and PAULA.
متن کاملA Corpus of Textual Revisions in Second Language Writing
This paper describes the creation of the first large-scale corpus containing drafts and final versions of essays written by non-native speakers, with the sentences aligned across different versions. Furthermore, the sentences in the drafts are annotated with comments from teachers. The corpus is intended to support research on textual revision by language learners, and how it is influenced by f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004